El advenimiento constante de nuevos medios de comunicación, la mutación de la forma en la que nos comunicamos, el formato de la información que transmitimos y las condiciones generales en que ocurre la comunicación (que acordamos compartir y que no), nos ha hecho pensar en qué significa escribir y leer una noticia, y cuál es el impacto que esta tiene en un lugar y momento adecuados.
Un caso especial es la proliferación de nuevos medios de noticias a través de internet. Estos llegan a sus usuarios a través de sus computadoras y teléfonos celulares, innovando en nuevos formatos que atraigan a sus usuarios y que les permitan contar su verdad periodística. Son muy importantes debido a que se adentran en el diario vivir de las personas, y configuran su visión de mundo, tanto a través de videos y fotos, como de artículos, que con facilidad se vuelven virales en plataformas como Facebook o Twitter.
Vox Media, que comenzó el año 2014, es un perfecto ejemplo de este tipo de medios. Tanto con sus videos como artículos tocan temas de intereses muy variados (política, economía, música, datos freaks, etc…) con lo cual han logrado posicionarse como una plataforma confiable en el criterio de sus usuarios, expandiéndose incluso para tener una versión española. Según ellos, la misión de Vox es “explicar las noticias”, es decir, buscan que el lector “entienda lo que ocurrió” a través de la entrega de “información contextual que los sitios tradicionales no acostumbran a transmitir”.
Esto nos da una idea de la importancia que tiene lo que NO se dice explícitamente en sus artículos. ¿Qué temáticas son las más tratadas en los distintos países? ¿Con qué enfoque se habla de cada tema? ¿Como cambia la visión de una misma temática a lo largo del tiempo?. Estas y muchas otras preguntas nos incentivaron a escoger este dataset, pues el trabajo que se realice con este conjunto de datos será aplicable a cualquier plataforma, cualquier medio de comunicación, y será un gran aporte para poder comprender de una forma distinta como se escribe actualmente la historia en el mundo.
El dataset que utilizaremos, correspondiente a artículos variados de Vox, está publicado en Data World, por Elena Zheleva con el objetivo de “permitir a los data scientists aplicar técnicas sobre un dataset de noticias” qué es exactamente lo que se busca hacer en este proyecto.
Al elegir este conjunto de datos, como sus características (atributos) no son extensos, debemos darnos cuenta que es otro el tipo de trabajo el que busca realizarse. En concreto, es el concepto de Minería de Textos. Para esto, nos encontramos con dos grandes tópicos que son los mismos que vemos en el curso, o sea, clustering y clasificación. Por esto, es que presentaremos a continuación un listado de problemáticas/soluciones que nos parecen interesantes de desarrollar, pero de las cuales podremos ver que tan realizables son para elegir finalmente una que nos permita explotar al máximo los recursos aprendidos en el curso y no fracasar en el intento.
Artículos de interés para usuarios: La clasificación actual permite a los usuarios llegar al contenido que se relaciona con sus áreas de interés. Es un problema interesante analizar la categorización de nuevas noticias, de acuerdo a la clasificación que se ha realizado hasta el momento (o sea, realizar una predicción de la categoría de una noticia en base a su contenido). Esto permitiría que el contenido de nuevos autores se ajuste a las preferencias ya establecidas de los usuarios al portal de noticias, en base a la selección de áreas de interés y por tanto la transversalidad que tengan las nuevas noticias con respecto a esas elecciones. Para iniciar el proceso de clasificación, además de la limpieza inicial de los datos, deberán probarse los distintos algoritmos de clasificación que hemos visto (Árbol de decisión, KNN, Hunt, etc…) para evaluar quien nos entrega mejor resultados según nuestras métricas de desempeño.
Exploración de agrupación de noticias: La categorización actual de las noticias permite a los usuarios obtener noticias de temas similares de manera fácil. Analizar los agrupamientos “naturales” que tiene el universo de noticias es una herramienta que permitiría ajustar la división de noticias según las características específicas de VOX. En base a una noticia específica se podría entregar noticias similares de acuerdo al proceso de clustering, en vez de la clasificación rígida actual. El proceso de clustering también permitiría evaluar la clasificación actual y ver si se correlaciona con la agrupación natural que tengan las noticias, pues actualmente existen categorías que contienen una cantidad extremadamente pequeña de artículos asociadas, que podrían quizás ser reorganizadas de mejor forma para una búsqueda más cómoda y eficiente para sus usuarios. Para abordar inicialmente la exploración, la mejor forma es probar los distintos algoritmos de clustering (Hierarchical, K-Means, Distribution-Based, etc…) para clasificar cuales nos entreguen los resultados más interesantes y aplicables a nuestra idea, siendo este proceso netamente experimental.
Análisis de sentimientos: Uno de las técnicas utilizadas en el text mining es el llamado análisis de sentimientos. Esto permite poder evaluar si un escrito tiene un sentido positivo o negativo, feliz o triste, entre otros. Con la capacidad de extraer la intención del artículo de los distintos autores, podemos dilucidar la línea editorial de un medio, al asociar como ven por ejemplo los distintos países y sus problemáticas a lo largo del tiempo. Esta es una de las posibilidades, pues el análisis puede usarse también para determinar la subjetividad/objetividad de cierto artículo y relacionarlo con la cantidad de veces que se realiza esto en determinadas categorías (para evaluar las que sean más especulativas). Así, vemos que es una técnica que nos permitirá generar mucha información a partir de una comprensión más “humana” de los textos. Para abordar inicialmente este tipo de problemática, se probarán las distintas herramientas actuales Open Source existentes para el análisis de sentimientos, tales como: GATE plugins, Stanford Sentiment Analysis Module, LingPipe, entre otros.
Generación automática de título y blurb: Por medio del análisis de tópicos (extracción de palabras claves de un texto), podremos obtener cuáles serán los contenidos de un artículo, y por ende cuales son los puntos claves para comprenderlo (y que para cualquier medio de noticias respetable, debería aparecer en su título y blurb-”título promocional”). A partir de un entrenamiento de artículos con sus respectivos títulos (o incluso sin estos) podremos encontrar la forma de generar automáticamente los mejores títulos para cada noticia, evitando así este proceso para el autor. Para abordar inicialmente este proyecto, se pueden probar los algoritmos actuales de modelamiento de tópicos (como LDA topics, TextRank, etc…) para adecuarlos a nuestras necesidades.
Cada uno de estos tópicos serán analizados a medida que el semestre avance y así concretar y definir cuál será la idea que mejor se adecúe al nivel de aprendizaje del curso, nuestra capacidad multidisciplinar y el tiempo que tengamos para explotar nuestros datos.
La base de datos original consiste en 8 atributos por noticia: título, autor, categoría, fecha publicación, fecha actualización, dirección web, bajada de título, cuerpo de la noticia. Sus nombres originales en inglés se muestran a continuación, junto con el código utilizado para explorar los datos, tanto en su contenido, formato como los errores posibles, los NA’s, entre otras cosas:
library(readr, warn.conflicts=F, quietly=T)
datos <- read_tsv("dsjVoxArticles.tsv")
## Parsed with column specification:
## cols(
## title = col_character(),
## author = col_character(),
## category = col_character(),
## published_date = col_datetime(format = ""),
## updated_on = col_datetime(format = ""),
## slug = col_character(),
## blurb = col_character(),
## body = col_character()
## )
df = as.data.frame(datos) # Convertir todo a data.frame
summary(df)
## title author category
## Length:23022 Length:23022 Length:23022
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## published_date updated_on
## Min. :2014-03-31 14:01:30 Min. :2014-04-07 17:46:29
## 1st Qu.:2015-03-25 14:55:02 1st Qu.:2015-05-05 06:05:03
## Median :2015-12-09 21:10:02 Median :2016-02-15 13:20:18
## Mean :2015-11-25 09:14:55 Mean :2016-01-01 02:49:48
## 3rd Qu.:2016-07-25 12:45:02 3rd Qu.:2016-08-21 13:14:59
## Max. :2017-03-21 23:00:01 Max. :2017-04-11 15:14:24
## NA's :111 NA's :111
## slug blurb body
## Length:23022 Length:23022 Length:23022
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
Podemos observar que la lectura realizada entrega 23,022 datos. De acuerdo a la página web donde se encuentran publicados los datos, el dataset contiene 22,994 datos. Esto significa que hay cierta cantidad de lecturas que no están siendo correctas (esta información también es entregada en un warning al realizar la lectura de datos, el cual se ha decidido ocultar por su longitud).
A continuación se muestra la primera fila del dataset, correspondiente a un artículo sobre Bitcoin, para observar el formato que posee:
# TÍTULO
df[1,1]
## [1] "Bitcoin is down 60 percent this year. Here's why I'm still optimistic."
# AUTOR
df[1,2]
## [1] "Timothy B. Lee"
# CATEGORÍA
df[1,3]
## [1] "Business & Finance"
# FECHA DE PUBLICACIÓN
df[1,4]
## [1] "2014-03-31 14:01:30 UTC"
# FECHA DE ACTUALIZACIÓN
df[1,5]
## [1] "2014-12-16 16:37:36 UTC"
# URL
df[1,6]
## [1] "http://www.vox.com/2014/3/31/5557170/bitcoin-bad-currency-good-network"
# BLURB (BAJADA DE TÍTULO)
df[1,7]
## [1] "Bitcoins have lost more than 60 percent of their value this year. But their long-term outlook is still bright."
# CUERPO DEL TEXTO
df[1,8]
## [1] "<p>The markets haven't been kind to<span> </span><a href=\"http://www.vox.com/cards/bitcoin/\" style=\"font-size: 17px; line-height: 28.4624996185303px;\">Bitcoin</a> in 2014. The currency reached a high of nearly $1,000 in January before falling to around $350 this month, a plunge of more than 60 percent. It would be easy to write Bitcoin off as a fad whose novelty has worn off.</p> \\n<p>After all, dollars seem superior in almost every respect. T<span>hey're accepted everywhere, they're convenient to use, and they have a stable value. Bitcoin is an inferior currency on all three counts.</span></p> \\n<p><q class=\"right\"><span>Bitcoin's detractors are making the same mistake as many Bitcoin fans</span> </q></p> \\n<p><span>Yet it would be foolish to write Bitcoin off. The currency has had months-long slumps in the past, only to bounce back. </span>More importantly, it's a mistake to think about Bitcoin as a new kind of currency. W<span>hat makes Bitcoin potentially revolutionary is that it's the world's first completely open financial network.</span></p> \\n<p><span>History suggests that open platforms like Bitcoin often become fertile soil for innovation. Think about the internet. It didn't seem like a very practical technology in the 1980s. But it was an open platform that anyone could build on, and in the long run it proved to be really useful.</span></p> \\n<p>The internet succeeded because Silicon Valley have created applications that harness the internet's power while shielding users from its complexity. <span>You don't have to be an expert on the internet's TCP/IP protocols to check Facebook on your iPhone.</span></p> \\n<p>Bitcoin applications can work the same way. There are already some Bitcoin applications that allow customers to make transactions over the Bitcoin network without being exposed to fluctuations in the value of Bitcoin's currency. That basic model should work for a wide variety of Bitcoin-based services, allowing the Bitcoin payment network to reach a mainstream audience.</p> \\n<p><img src=\"http://cdn2.vox-cdn.com/assets/4215235/imp.jpg\" class=\"photo\" alt=\"Imp\"></p> \\n<p><span><span></span></span></p> \\n<p class=\"caption\">This is the very first node on the ARPANET, the predecessor to the Internet that launched in 1969. (Flickr/<a href=\"https://www.flickr.com/photos/fastlizard4/6294438012/\">FastLizard4</a>)</p> \\n<p><span></span></p> \\n<h3><span>The first open financial network</span></h3> \\n<p>The Bitcoin network serves the same purpose as mainstream payment networks such as Visa or Western Union. But there's an important difference. The Visa and Western Union networks are owned and operated by for-profit companies. If you want to build a business based on one of those networks, you have to get permission from the owner.</p> \\n<p>And that's not always easy. To use the Visa network, for example, you have to comply with <a href=\"https://usa.visa.com/download/merchants/visa-international-operating-regulations-main.pdf\">hundreds of pages</a> of regulations. The Visa network also has high fees, and there are some things Visa won't let you do on its network at all.</p> \\n<p>Bitcoin is different. Because no one owns or controls the network, there are no limits on how people can use it. Some people have used that freedom to do illegal things like buying drugs or gambling online. But it also means there's a low barrier to entry for building new Bitcoin-based financial services.</p> \\n<p>There's an obvious parallel to the internet. Before the internet became mainstream, the leading online services were commercial networks like Compuserve and Prodigy. The companies that ran the network decided what services would be available on them.</p> \\n<p><span>In contrast, the internet was designed for anyone to create new services. Tim Berners-Lee didn't need to ask anyone's permission to create the world wide web. He simply wrote the first web browser and web server and posted them online for others to download. Soon thousands of people were using the software and the web was born.</span></p> \\n<h3>Finding Bitcoin's killer app</h3> \\n<p>So what will people do with Bitcoin? It's hard to predict tomorrow's innovations, but we can get some idea of Bitcoin's potential by thinking about weaknesses of the conventional financial system.</p> \\n<p><q class=\"left\"><span>Bitcoin is such a good deal for merchants that they may find it worthwhile to offer their customers discounts for paying with Bitcoin instead of cash or credit cards</span></q></p> \\n<p>One obvious application is international money transfers. Companies like Western Union and Moneygram can charge as much as 8 percent to transfer cash from one country to another, and transfers can take as long as 3 days to complete. In contrast, Bitcoin transactions only take about 30 minutes to clear, and Bitcoin transaction fees could be a lot less than 8 percent.</p> \\n<p>People have been <a target=\"_blank\" href=\"http://www.vox.com/2014/11/1/7139785/there-are-now-285-bitcoin-atms-around-the-world\">building Bitcoin ATMs</a> to let people convert between bitcoins and their local currency. The first Bitcoin ATM was launched a little over a year ago. Today, there are <a target=\"_blank\" href=\"http://coinatmradar.com/\">329 of them</a>.</p> \\n<p><span>If these devices continue to proliferate, they could become a useful alternative to conventional money-transfer services. Currently, each machine charges a transaction fee of around 3 percent, so the total cost of transferring money from one Bitcoin ATM to another is around 6 percent. That's comparable to the fees charged by incumbent money transfer services, and competition is likely to push down Bitcoin ATM fees over time.</span></p> \\n<p>A more ambitious application for Bitcoin would be as an alternative to credit cards for daily purchases. Startups such Bitpay have already figured out how to make Bitcoin attractive to merchants as a way of accepting payments. Credit card networks charge merchants around 3 percent to process transactions. Bitpay charges 1 percent or less to accept Bitcoin payments on behalf of merchants. Bitpay merchants don't have to worry about the headache of disputed payments known as \"chargebacks.\"</p> \\n<p><span>Of course, for Bitcoin to take off as an alternative to credit cards, consumers will have to start using them regularly. And that's going to be a hard sell, especially if consumers are exposed to the risk of Bitcoin's volatility.</span></p> \\n<p>But a Bitcoin-based payment app could also have some advantages. One is security. The current credit card network essentially works on the honor system, allowing any merchant to charge a credit card and relying on after-the-fact adjudication to police fraud. Bitcoin could allow companies to experiment with alternative approaches that build in security at the front end, for example by asking users to confirm a transaction on their smartphones before it's approved. That could cut fraud, reducing the hassle of disputing fraudulent payments and allowing lower fees.</p> \\n<p>Moreover, Bitcoin is such a good deal for merchants that they may find it worthwhile to offer their customers discounts for paying with Bitcoin instead of cash or credit cards. That might entice bargain-hunting consumers who aren't otherwise interested in trying a new payment technology.</p> \\n<p><img src=\"http://cdn0.vox-cdn.com/assets/4215267/6355318323_4c41d3ef76_b.jpg\" class=\"photo\" alt=\"6355318323_4c41d3ef76_b\"></p> \\n<p class=\"caption\"><a href=\"https://www.flickr.com/photos/68751915@N05/6355318323/sizes/l\" style=\"font-size: 12px; font-style: italic; line-height: 15px;\">401(K) 2013</a></p> \\n<h3><span>Using Bitcoin-the-network without Bitcoin-the-currency</span></h3> \\n<p>The biggest stumbling block for many Bitcoin services is Bitcoin's volatility. The current generation of Bitcoin \"wallet\" apps, which store bitcoins on behalf of users, expose consumers to fluctuations in Bitcoin's value. Ordinary consumers are unlikely to ever be comfortable with a payment system where their wealth can shrink by 10 percent or more in a single day.</p> \\n<p>Fortunately, it's possible to design Bitcoin-based financial services that don't expose users to fluctuations in Bitcoin's value. Bitpay is a good example. Bitpay merchants set prices in conventional currencies such as the dollar, converting to the equivalent number of Bitcoins at the time of sale. Once the sale is made, Bitpay immediately converts it to an equivalent number of dollars and deposits the cash in the merchant's conventional bank account. This means that from the merchant's perspective, Bitpay is just another way of accepting dollars. A Bitpay merchant isn't affected at all by fluctuations in the value of Bitcoin.</p> \\n<p>The same principle could apply to any other Bitcoin-based service. A consumer-friendly payments app could store a user's cash in dollars, converting them to bitcoins at the time of payment. Under the hood the app could use the full power of the Bitcoin platform, but from the user's perspective it would just seem like another way of paying for things with dollars.</p> \\n<p>You're probably wondering: if Bitcoin's value is as a payment network, why not just build an open payment network based on a conventional currency like the dollar? An open, dollar-based payment network would be a great idea. The problem is that no one has figured out how to build one.</p> \\n<p>Dollars in the Paypal network are worth a dollar because the Paypal company has promised to honor withdrawal requests. But there's no organization to perform this role on a peer-to-peer network like Bitcoin.</p> \\n<p>Suppose there were a network called Dollarcoin that worked exactly like Bitcoin except a company called Dollarcoin Inc. promised to convert dollarcoins into dollars. Then the value of one dollarcoin would always equal one dollar. But as the manager of the Dollarcoin network, the Dollarcoin company would face pressure to comply with a variety of laws regarding fraud, money laundering, and so forth. To keep the costs of complying with those requirements under control, it would be forced to regulate who could use the network and how. (Indeed, that's exactly what <a href=\"http://www.amazon.com/The-PayPal-Wars-Battles-Planet/dp/1936488590\">happened to Paypal</a> 15 years ago.) Over time, the Dollarcoin network could become as restrictive as conventional financial networks.</p> \\n<p>Bitcoin's openness depends on the fact that no one owns the network. And with no owner, there's no one to guarantee that bitcoins have a predictable value.</p> \\n<p>Many of Bitcoin's early adopters were acolytes of Ron Paul's brand of hard-money libertarianism. They were attracted to the promise of a currency whose supply was outside of state control, and as a consequence, Bitcoin has gained a reputation as the second coming of the gold standard. That, in turn, has made mainstream economists who are hostile to Ron Paul and the gold standard hostile to Bitcoin.</p> \\n<p>But in reality, the case for Bitcoin simply doesn't have much to do with its unorthodox monetary policy. Bitcoin is a payment network that happens to have its own currency, not the other way around. It's worth taking seriously whether or not you agree with Ron Paul's views on the Federal Reserve.</p>"
Se observan dos cosas esencialmente: 1. Los datos que son esenciales para un trabajo minería son los entregados en el cuerpo del texto (columna 8), y 2. Que el cuerpo de la noticia tiene comandos utilizados para dar formato dentro de la página web (código html) y caracteres especiales incompresibles causados posiblemente por la transcripción de un formato a otro. Será necesario filtrar esos comandos y caracteres para hacer un análisis profundo y significativo de las palabras, frases y significado que se encuentre en el texto.
A continuación, se muestran los comandos ejecutados para explorar e identicar los distintos errores (y su número) presentes en los datos:
# Cuántos artículos con títulos en blanco
sum(is.na(df[,1]))
## [1] 0
# Cuántos artículos sin autor
sum(is.na(df[,2]))
## [1] 14
# Cuántos artículos sin categoría
sum(is.na(df[,3]))
## [1] 111
# Cuántos artículos sin fecha de publicación
sum(is.na(df[,4]))
## [1] 111
# Cuántos artículos sin blurb
sum(is.na(df[,7]))
## [1] 2830
# Cantidad de articulos sin body
sum(is.na(datos[,8]))
## [1] 111
Es importante destacar que hay atributos de estos datos que no nos sirven para nuestro proyecto (autor, fechas, blurb) y otros que importan pero no son relevantes en sí para nuestra experimentación (título). Por tanto, la cantidad que más nos interesa disminuir es la cantidad de artículos sin body y sin categoría (ya sea por un error en la integración de los datos o por simplemente la ausencia de este).
Para realizar este análisis, contaremos la cantidad total de datos que no poseen valores nulos (NA) en alguno de sus atributos:
sumadeNA <- rowSums(is.na(df))
df_na <- cbind(df, n_na = sumadeNA)
sum(df_na$n_na == 0)
## [1] 20180
De esto podemos concluir preliminarmente que tenemos un total de 20182 datos sin errores y alrededor de 2000 datos con problemas.
Veremos cuántas categorías distintas existen, y de qué temas tratan:
unique(df[,3])
## [1] "Business & Finance"
## [2] "War on Drugs"
## [3] "Criminal Justice"
## [4] "Health Care"
## [5] "Explainers"
## [6] "Life"
## [7] "Science & Health"
## [8] "Neuroscience"
## [9] "Apple"
## [10] "Politics & Policy"
## [11] "Culture"
## [12] "Human Rights"
## [13] "The Latest"
## [14] "World"
## [15] "Marriage Equality"
## [16] "Almanac"
## [17] "Transportation"
## [18] "Space"
## [19] NA
## [20] "Emmy Awards"
## [21] "Xpress"
## [22] "Identities"
## [23] "Marijuana Legalization"
## [24] "Joe Biden"
## [25] "Star Wars"
## [26] "Sports"
## [27] "North Korea"
## [28] "On Instagram"
## [29] "Race in America"
## [30] "Media"
## [31] "Education"
## [32] "Gender-Based Violence"
## [33] "Supreme Court"
## [34] "Gender Equality"
## [35] "Orange Is the New Black"
## [36] "Immigration"
## [37] "Hillary Clinton"
## [38] "Gun Violence"
## [39] "Politics"
## [40] "ISIS"
## [41] "NFL"
## [42] "Science of Everyday Life"
## [43] "Infectious Disease"
## [44] "Congress"
## [45] "College Football"
## [46] "Campaign Finance"
## [47] "Books"
## [48] "Vox"
## [49] "Music"
## [50] "2016 Presidential Election"
## [51] "LGBTQ"
## [52] "Interviews"
## [53] "Ted Cruz"
## [54] "Energy & Environment"
## [55] "Genetics"
## [56] "True Detective"
## [57] "Officer-Involved Shootings"
## [58] "Jeb Bush"
## [59] "Israel-Palestine Conflict"
## [60] "The Big Idea"
## [61] "MTV VMAs"
## [62] "Religion"
## [63] "Comic Books"
## [64] "Labor Market"
## [65] "Obamacare"
## [66] "Global Governance"
## [67] "Television"
## [68] "Movies"
## [69] "Natural Disasters"
## [70] "Syria"
## [71] "Celebrities"
## [72] "Technology"
## [73] "Russia"
## [74] "Best of 2014"
## [75] "Poverty"
## [76] "Economic Mobility"
## [77] "2014 Midterm Elections"
## [78] "Ebola"
## [79] "Scotland"
## [80] "Climate Change"
## [81] "Net Neutrality"
## [82] "Features"
## [83] "First Person"
## [84] "Maps"
## [85] "Cuba"
## [86] "Hollywood"
## [87] "True Detective, Season 2"
## [88] "NSA"
## [89] "Mike Pence"
## [90] "Social Policy"
## [91] "China"
## [92] "Gilmore Girls"
## [93] "Marvel"
## [94] "Vox Sentences"
## [95] "Voting Rights"
## [96] "Avengers: Age of Ultron"
## [97] "Oscars"
## [98] "Serial"
## [99] "Internet Security"
## [100] "Bernie Sanders"
## [101] "Telecoms"
## [102] "2016 Rio Olympics"
## [103] "Gift Guides"
## [104] "Fear the Walking Dead"
## [105] "2016 Golden Globes"
## [106] "Mad Men"
## [107] "The Americans"
## [108] "2016 Grammys"
## [109] "Game of Thrones"
## [110] "Video"
## [111] "Reviews"
## [112] "Marco Rubio"
## [113] "Donald Trump"
## [114] "Mad Men, season 7"
## [115] "Mad Men, season 7, episode 8"
## [116] "Grist"
## [117] "Game of Thrones, season 5, episode 1"
## [118] "Mad Men, season 7, episode 9"
## [119] "Game of Thrones, season 5, episode 2"
## [120] "Mad Men, season 7, episode 10"
## [121] "Mad Men, season 7, episode 11"
## [122] "Game of Thrones, season 5, episode 3"
## [123] "Game of Thrones, season 5, episode 4"
## [124] "Mad Men, season 7, episode 12"
## [125] "Carly Fiorina"
## [126] "Game of Thrones, season 6"
## [127] "Mad Men, season 7, episode 13"
## [128] "Small Business"
## [129] "Mad Men, season 7, episode 14"
## [130] "Hannibal, season 3"
## [131] "One Change to Save the World"
## [132] "Game of Thrones, season 5, episode 10"
## [133] "Debates"
## [134] "Hate Crimes"
## [135] "True Detective, Season 2, Episode 1"
## [136] "Mr. Robot"
## [137] "True Detective, Season 2, Episode 2"
## [138] "True Detective Season 2, Episode 3"
## [139] "True Detective, Season 2, Episode 4"
## [140] "Ant-Man"
## [141] "True Detective, Season 2, Episode 5"
## [142] "True Detective Season 2, Episode 6"
## [143] "True Detective Season 2, Episode 7"
## [144] "Dear Julia"
## [145] "UnReal"
## [146] "True Detective Season 2, Episode 8"
## [147] "Show Me the Evidence"
## [148] "Polyarchy"
## [149] "Fear the Walking Dead, season 1, episode 1"
## [150] "Fear the Walking Dead, season 1, episode 2"
## [151] "Mischiefs of Faction"
## [152] "Fear the Walking Dead, Season 1"
## [153] "Super Bowl 51"
## [154] "Episode of the Week"
## [155] "Conversations"
## [156] "Podcasts"
## [157] "The Walking Dead"
## [158] "On Snapchat"
## [159] "Making a Murderer"
## [160] "2016ish"
## [161] "Game of Thrones, season 6, episode 1"
## [162] "Game of Thrones, season 6, episode 2"
## [163] "Game of Thrones season 6, episode 3"
## [164] "Game of Thrones, season 6, episode 4"
## [165] "Game of Thrones, season 6, episode 5"
## [166] "Game of Thrones, season 6, episode 6"
## [167] "Game of Thrones, season 6, episode 7"
## [168] "Game of Thrones, season 6, episode 9"
## [169] "Internet Culture"
## [170] "Game of Thrones, season 6, episode 10"
## [171] "The Night Of"
## [172] "New Money"
## [173] "Westworld, season 1"
## [174] "Movie of the Week"
## [175] "Black Mirror, Season 3"
## [176] "Weeds in the Wild"
## [177] "Policy"
## [178] "Best of 2016"
## [179] "Pentagon"
## [180] "Terrorism"
## [181] "Obama Administration"
## [182] "Hannibal"
## [183] "Strikethrough"
## [184] "I Think You're Interesting"
## [185] "Social Programs"
## [186] "Reproductive Health"
Se tiene un total de 186 categorías, dentro de las cuales se encuentran categorías muy específicas que hacen aluciones a ciertas series de televisión, resumenes anuales, personas, entre otras. Es necesario un análisis de los tipos de categorías que se utilizan para saber si este atributo es de especial relevancia dentro de la página VOX, y en caso de no serlo, elaborar una propuesta en base al trabajo que se obtenga en este proyecto para poder encontrar las mejores clasificaciones para los artículos.
Con los datos ya estudiados, se decidió que los proyectos a realizar en esta primera etapa serán: la agrupación natural de noticias por medio de clusterización, y el análisis de sentimiento de noticias.
El primero con el objetivo de explorar las posibilidades de clusterización en base a datos extraídos a partir de texto, para encontrar similitudes (comprensibles a primera vista o no) entre distintos artículos y que nos permita agruparlos y generar categorías intrínsecamente ligadas a los tópicos de los que se habla en Vox.
El segundo, con la intención de encontrar nuevas relaciones entre cómo se escriben los artículos y quién los escribe, o sea, cómo la visión de los países, conflictos, leyes, y todo de lo que se escriba, se modifica cuando el observador (en este caso los escritores) tienen un sesgo ideológico (visión propia de lo que está bien o mal). Esto, con la intención de extender este trabajo a distintos medios y observar cómo va variando tanto el sentimiento como la forma con la cual se reportea, e identificar así el contexto del medio del cual nos estamos informando. Para esto, es necesario identificar y comprender los elementos claves que caracterizan una intención literaria, y las herramientas que nos permiten realizar este trabajo.
Con esto, las principales preguntas de investigación e hipótesis desprendidas son:
Los experimentos realizados para la agrupación natural de noticias consistieron en explorar las distintas posibilidades que ofrece la librería tm (proveniente de text mining) de R. Esta provee multiples herramientas para trabajar con texto. Para esto se utilizó el tutorial indicado en las referencias. Debido a que hizo falta procesar más los datos (el body de los artículos) para que estuvieran en condiciones de ser procesados por la librería, no se pudieron tener resultados concretos para esta etapa.
Sin embargo, algunas de las posibilidades que ofrece la librería tm son: + Hacer limpieza automática del texto + Definir una noción de distancia entre dos artículos + Agrupar los artículos según la distancia recién mencionada
Se realizó primeramente, un anáisis exploratorio de la primera fila, del cual se realizó un proceso de limpieza extrayendo los elementos HTML del body, los caracteres especiales, los saltos de línea y cualquier tipo de puntuación:
library(RCurl)
library(XML)
library(wordcloud)
doc <- htmlParse(df[1,8], asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
cond <- lapply(text, function(x) x != " \\n")
text <- text[unlist(cond)]
res <- paste(text, collapse = " ")
res <- gsub("[^a-zA-Z0-9[:punct:]]", " ", res)
fileText <- gsub("\\$", "", res)
Con lo cual se obtiene el siguiente texto bruto, que podemos comparar con el indicado en la sección Resumen de los datos
fileText
## [1] "The markets haven't been kind to Bitcoin in 2014. The currency reached a high of nearly 1,000 in January before falling to around 350 this month, a plunge of more than 60 percent. It would be easy to write Bitcoin off as a fad whose novelty has worn off. After all, dollars seem superior in almost every respect. T hey're accepted everywhere, they're convenient to use, and they have a stable value. Bitcoin is an inferior currency on all three counts. Bitcoin's detractors are making the same mistake as many Bitcoin fans Yet it would be foolish to write Bitcoin off. The currency has had months-long slumps in the past, only to bounce back. More importantly, it's a mistake to think about Bitcoin as a new kind of currency. W hat makes Bitcoin potentially revolutionary is that it's the world's first completely open financial network. History suggests that open platforms like Bitcoin often become fertile soil for innovation. Think about the internet. It didn't seem like a very practical technology in the 1980s. But it was an open platform that anyone could build on, and in the long run it proved to be really useful. The internet succeeded because Silicon Valley have created applications that harness the internet's power while shielding users from its complexity. You don't have to be an expert on the internet's TCP/IP protocols to check Facebook on your iPhone. Bitcoin applications can work the same way. There are already some Bitcoin applications that allow customers to make transactions over the Bitcoin network without being exposed to fluctuations in the value of Bitcoin's currency. That basic model should work for a wide variety of Bitcoin-based services, allowing the Bitcoin payment network to reach a mainstream audience. This is the very first node on the ARPANET, the predecessor to the Internet that launched in 1969. (Flickr/ FastLizard4 ) The first open financial network The Bitcoin network serves the same purpose as mainstream payment networks such as Visa or Western Union. But there's an important difference. The Visa and Western Union networks are owned and operated by for-profit companies. If you want to build a business based on one of those networks, you have to get permission from the owner. And that's not always easy. To use the Visa network, for example, you have to comply with hundreds of pages of regulations. The Visa network also has high fees, and there are some things Visa won't let you do on its network at all. Bitcoin is different. Because no one owns or controls the network, there are no limits on how people can use it. Some people have used that freedom to do illegal things like buying drugs or gambling online. But it also means there's a low barrier to entry for building new Bitcoin-based financial services. There's an obvious parallel to the internet. Before the internet became mainstream, the leading online services were commercial networks like Compuserve and Prodigy. The companies that ran the network decided what services would be available on them. In contrast, the internet was designed for anyone to create new services. Tim Berners-Lee didn't need to ask anyone's permission to create the world wide web. He simply wrote the first web browser and web server and posted them online for others to download. Soon thousands of people were using the software and the web was born. Finding Bitcoin's killer app So what will people do with Bitcoin? It's hard to predict tomorrow's innovations, but we can get some idea of Bitcoin's potential by thinking about weaknesses of the conventional financial system. Bitcoin is such a good deal for merchants that they may find it worthwhile to offer their customers discounts for paying with Bitcoin instead of cash or credit cards One obvious application is international money transfers. Companies like Western Union and Moneygram can charge as much as 8 percent to transfer cash from one country to another, and transfers can take as long as 3 days to complete. In contrast, Bitcoin transactions only take about 30 minutes to clear, and Bitcoin transaction fees could be a lot less than 8 percent. People have been building Bitcoin ATMs to let people convert between bitcoins and their local currency. The first Bitcoin ATM was launched a little over a year ago. Today, there are 329 of them . If these devices continue to proliferate, they could become a useful alternative to conventional money-transfer services. Currently, each machine charges a transaction fee of around 3 percent, so the total cost of transferring money from one Bitcoin ATM to another is around 6 percent. That's comparable to the fees charged by incumbent money transfer services, and competition is likely to push down Bitcoin ATM fees over time. A more ambitious application for Bitcoin would be as an alternative to credit cards for daily purchases. Startups such Bitpay have already figured out how to make Bitcoin attractive to merchants as a way of accepting payments. Credit card networks charge merchants around 3 percent to process transactions. Bitpay charges 1 percent or less to accept Bitcoin payments on behalf of merchants. Bitpay merchants don't have to worry about the headache of disputed payments known as \"chargebacks.\" Of course, for Bitcoin to take off as an alternative to credit cards, consumers will have to start using them regularly. And that's going to be a hard sell, especially if consumers are exposed to the risk of Bitcoin's volatility. But a Bitcoin-based payment app could also have some advantages. One is security. The current credit card network essentially works on the honor system, allowing any merchant to charge a credit card and relying on after-the-fact adjudication to police fraud. Bitcoin could allow companies to experiment with alternative approaches that build in security at the front end, for example by asking users to confirm a transaction on their smartphones before it's approved. That could cut fraud, reducing the hassle of disputing fraudulent payments and allowing lower fees. Moreover, Bitcoin is such a good deal for merchants that they may find it worthwhile to offer their customers discounts for paying with Bitcoin instead of cash or credit cards. That might entice bargain-hunting consumers who aren't otherwise interested in trying a new payment technology. 401(K) 2013 Using Bitcoin-the-network without Bitcoin-the-currency The biggest stumbling block for many Bitcoin services is Bitcoin's volatility. The current generation of Bitcoin \"wallet\" apps, which store bitcoins on behalf of users, expose consumers to fluctuations in Bitcoin's value. Ordinary consumers are unlikely to ever be comfortable with a payment system where their wealth can shrink by 10 percent or more in a single day. Fortunately, it's possible to design Bitcoin-based financial services that don't expose users to fluctuations in Bitcoin's value. Bitpay is a good example. Bitpay merchants set prices in conventional currencies such as the dollar, converting to the equivalent number of Bitcoins at the time of sale. Once the sale is made, Bitpay immediately converts it to an equivalent number of dollars and deposits the cash in the merchant's conventional bank account. This means that from the merchant's perspective, Bitpay is just another way of accepting dollars. A Bitpay merchant isn't affected at all by fluctuations in the value of Bitcoin. The same principle could apply to any other Bitcoin-based service. A consumer-friendly payments app could store a user's cash in dollars, converting them to bitcoins at the time of payment. Under the hood the app could use the full power of the Bitcoin platform, but from the user's perspective it would just seem like another way of paying for things with dollars. You're probably wondering: if Bitcoin's value is as a payment network, why not just build an open payment network based on a conventional currency like the dollar? An open, dollar-based payment network would be a great idea. The problem is that no one has figured out how to build one. Dollars in the Paypal network are worth a dollar because the Paypal company has promised to honor withdrawal requests. But there's no organization to perform this role on a peer-to-peer network like Bitcoin. Suppose there were a network called Dollarcoin that worked exactly like Bitcoin except a company called Dollarcoin Inc. promised to convert dollarcoins into dollars. Then the value of one dollarcoin would always equal one dollar. But as the manager of the Dollarcoin network, the Dollarcoin company would face pressure to comply with a variety of laws regarding fraud, money laundering, and so forth. To keep the costs of complying with those requirements under control, it would be forced to regulate who could use the network and how. (Indeed, that's exactly what happened to Paypal 15 years ago.) Over time, the Dollarcoin network could become as restrictive as conventional financial networks. Bitcoin's openness depends on the fact that no one owns the network. And with no owner, there's no one to guarantee that bitcoins have a predictable value. Many of Bitcoin's early adopters were acolytes of Ron Paul's brand of hard-money libertarianism. They were attracted to the promise of a currency whose supply was outside of state control, and as a consequence, Bitcoin has gained a reputation as the second coming of the gold standard. That, in turn, has made mainstream economists who are hostile to Ron Paul and the gold standard hostile to Bitcoin. But in reality, the case for Bitcoin simply doesn't have much to do with its unorthodox monetary policy. Bitcoin is a payment network that happens to have its own currency, not the other way around. It's worth taking seriously whether or not you agree with Ron Paul's views on the Federal Reserve."
Luego, se continuó con un proceso llamado tokenización (extracción de palabras de un texto), y se realizó un conteo de estos tokens para graficar un WordCloud, el cual por medio de las palabras (que no sean stop-words, o sea, conectores) y su frecuencia nos entrega una visión general de la temática tratada. Para esto se utilizó el siguiente código:
library(tidytext)
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
tokens <- data_frame(text = fileText) %>% unnest_tokens(word, text)
tokens %>% count(word, sort = TRUE)
## # A tibble: 575 x 2
## word n
## <chr> <int>
## 1 the 88
## 2 to 76
## 3 bitcoin 47
## 4 a 44
## 5 of 42
## 6 and 24
## 7 network 23
## 8 in 22
## 9 that 22
## 10 as 20
## # ... with 565 more rows
tokens %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
Podemos observar claramente que el tópico se mueve entre el bitcoin y distintos conceptos asociados (red, divisa, pagos, servicios), lo que nos da un buen indicio de la futura extracción del tópico del texto. Ahora, se realizará un análisis de sentimientos del texto, a través de la comparación de los tokens con un sentiment lexicon, que es un diccionario de palabras que expresan una negatividad o positividad, y que nos permitirá evaluar así cual es la intención del escritor.
tokens %>%
inner_join(get_sentiments("bing")) %>% # pull out only sentiment words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) %>% # made data wide rather than narrow
mutate(sentiment = positive - negative) # # of positive words - # of negative owrds
## # A tibble: 1 x 3
## negative positive sentiment
## <dbl> <dbl> <dbl>
## 1 30 62 32
Finalmente, vemos que el código nos permite obtener que el texto habla de forma positiva (sentiment es la diferencia entre el sentimiento con mayor puntaje) del Bitcoin, pero esta caso es un caso particular, y debemos extenderlo tanto para la obtención del tópico mismo (pues no necesariamente el término más frecuente es el tópico), y para todo el resto de filas de nuestro dataset.
Para lo experimentos de clustering se siguieron los pasos descritos en el tutorial del siguiente sitio, adaptándolos a los datos utilizados en el proyecto.
Lamentablemente, debido a la masividad de los datos utilizados, no se pudieron utilizar todos, teniendo que limitar el experimento a sólo 1000 datos. En el código mostrado a continuación, se eligieron los primeros 1000 datos. Los resultados no cambian mucho si es que se elige la muestra de una forma distinta (por ejemplo eligiendo otro intervalo, o eligiéndolos al azar)
El código a continuación se utilizó para realizar limpieza de los datos a utilizar, y a la vez separar las 1000 tuplas a utilizar.
library(RCurl)
library(XML)
library(wordcloud)
bodies <- df[1:1000,8,drop=FALSE]
for (i in 1:1000){
doc <- htmlParse(bodies[i,1], asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
cond <- lapply(text, function(x) x != " \\n")
text <- text[unlist(cond)]
res <- paste(text, collapse = " ")
res <- gsub("[^a-zA-Z0-9[:punct:]]", " ", res)
bodies[i,1] <- gsub("\\$", "", res)
}
bodies <- bodies[["body"]]
Esta linea crea un objeto corpus, que es utilizado para el clustering. Básicamente contiene los datos, y nos permite hacer procesamiento en ellos, como limpieza u otros.
corpus = tm::Corpus(tm::VectorSource(bodies))
Lo siguiente nos permite guardar los datos como datos numéricos.
tdm <- tm::DocumentTermMatrix(corpus)
tdm.tfidf <- tm::weightTfIdf(tdm)
## Warning in tm::weightTfIdf(tdm): empty document(s): 65 409 505
Esto calcula la “distancia” entre dos artículos. Necesario para algunos algoritmos de clustering
tfidf.matrix <- as.matrix(tdm.tfidf)
#Similitud coseno, utilizada en algunos algoritmos de clustering
dist.matrix = proxy::dist(tfidf.matrix, method = "cosine")
Estos son los clustering en si, utilizando 3 técnicas distintas.
clustering.kmeans <- kmeans(tfidf.matrix, centers=10, nstart=3)
clustering.hierarchical <- hclust(dist.matrix, method = "ward.D2")
clustering.dbscan <- dbscan::hdbscan(dist.matrix, minPts = 10)
Teniendo los 3 clustering, se combinan todos utilizando una técnica de master cluster y slave clusters. Finalmente, se genera un clustering combinado entre los 3.
master.cluster <- clustering.kmeans$cluster
slave.hierarchical <- cutree(clustering.hierarchical, k = 5)
slave.dbscan <- clustering.dbscan$cluster
stacked.clustering <- rep(NA, length(master.cluster))
names(stacked.clustering) <- 1:length(master.cluster)
for (cluster in unique(master.cluster)) {
indexes = which(master.cluster == cluster, arr.ind = TRUE)
slave1.votes <- table(slave.hierarchical[indexes])
slave1.maxcount <- names(slave1.votes)[which.max(slave1.votes)]
slave1.indexes = which(slave.hierarchical == slave1.maxcount, arr.ind = TRUE)
slave2.votes <- table(slave.dbscan[indexes])
slave2.maxcount <- names(slave2.votes)[which.max(slave2.votes)]
stacked.clustering[indexes] <- slave2.maxcount
}
Ploteo de los clusters.
points <- cmdscale(dist.matrix, k = 2)
palette <- colorspace::diverge_hcl(5) # Creating a color palette
previous.par <- par(mfrow=c(2,2), mar = rep(1.5, 4))
plot(points, main = 'K-Means clustering', col = as.factor(master.cluster),
mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0),
xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
plot(points, main = 'Hierarchical clustering', col = as.factor(slave.hierarchical),
mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0),
xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
plot(points, main = 'Density-based clustering', col = as.factor(slave.dbscan),
mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0),
xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
plot(points, main = 'Stacked clustering', col = as.factor(stacked.clustering),
mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0),
xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
par(previous.par) # recovering the original plot space parameters
En base a lo que fue recomendado la profesora, primero, le asignamos un título representativo a nuestro proyecto, y luego, realizamos la elección de estos dos subproyectos a realizar para esta etapa de pruebas. Los resultados preliminares nos indican que nuestro dataset es adecuado para que trabajemos con las herramientas que ya han sido desarrolladas y probadas en el mundo para este tipo de trabajo, y que claramente nosotros tendremos la tarea de adaptarlas y enfocarlas a la obtención de resultados particulares de nuestros objetivos, en concreto, explotar tanto la identificación de tópicos de manera óptima y precisa, y la evaluación de sentimientos también con un gran porcentaje de acierto.
Nos faltan aún procesos importantes correspondientes a la limpieza de datos, como es la extracción de las meta-noticias, el arreglo de los conflictos con las columnas faltantes, el solapamiento entre columnas, entre otros. Sin embargo, estamos cada vez más cerca de lograr tener el dataset preparado completamente para aplicarlo a cualquier herramienta que consideremos pertinente, y que por tanto nos permitirá luego enfocarnos a la elección de uno de estos subproyectos, el que logre ser más significativo e interesante para nuestros fines pedagógicos, y en el cual podremos entonces invertir el tiempo para una prueba más exhaustiva de herramientas de Text Mining y una modificación de estas, de una forma más relacionada al funcionamiento mismo de los algoritmos.
Con respecto a este experimentos, las futuras direcciones a seguir son:
De lo que se obtuvo de los datos y la primera exploración de los algoritmos podemos esperar realizar a futuro las siguientes cosas:
Luego de estudiar las distintas posibilidades de ambos trabajos, se decide optar por un camino que unifica ambos procesos (agrupación y análisis de sentimiento), que es finalmente el rumbo por el cual se enfocó el trabajo correspondiente a este Hito 3, y que pasaremos a detallar a continuación.
Siguiendo los pasos y basándose en el código de este sitio, se hace el análisis de tópicos para los 1000 datos tomados antes. Notar que estos datos se escogen de manera arbitraria. El código se muestra a continuación.
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
library(wordcloud)
library(slam)
library(topicmodels)
tdm = DocumentTermMatrix(corpus) # Creating a Term document Matrix
# create tf-idf matrix
term_tfidf <- tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) * log2(nDocs(tdm)/col_sums(tdm > 0))
summary(term_tfidf)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0005058 0.0084669 0.0123339 0.0199710 0.0206027 0.7212081
tdm <- tdm[,term_tfidf >= 0.1]
tdm <- tdm[row_sums(tdm) > 0,]
summary(col_sums(tdm))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 5.00 10.97 15.00 190.00
#Deciding best K value using Log-likelihood method
best.model <- lapply(seq(2, 50, by = 1), function(d){LDA(tdm, d)})
best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))
#calculating LDA
k = 50; #numero de topicos
SEED = 786;
CSC_TM <-list(VEM = LDA(tdm, k = k, control = list(seed = SEED)),VEM_fixed = LDA(tdm, k = k,control = list(estimate.alpha = FALSE, seed = SEED)),Gibbs = LDA(tdm, k = k, method = "Gibbs",control = list(seed = SEED, burnin = 1000,thin = 100, iter = 1000)),CTM = CTM(tdm, k = k,control = list(seed = SEED,var = list(tol = 10^-4), em = list(tol = 10^-3))))
sapply(CSC_TM[1:2], slot, "alpha")
## VEM VEM_fixed
## 0.004439402 1.000000000
sapply(CSC_TM, function(x) mean(apply(posterior(x)$topics, 1, function(z) - sum(z * log(z)))))
## VEM VEM_fixed Gibbs CTM
## 0.4315222 3.5934966 3.6124053 0.6740875
Topic <- topics(CSC_TM[["VEM"]], 1)
Terms <- terms(CSC_TM[["VEM"]], 8)
Terms
## Topic 1 Topic 2 Topic 3 Topic 4
## [1,] "ora" "assad" "pardon" "truvada"
## [2,] "ashanti" "sunscreen" "commutations" "strava"
## [3,] "sanneh" "easter" "pardons" "fob"
## [4,] "poem" "mites" "gaba" "cox"
## [5,] "salad" "maliki" "mitt" "parchman"
## [6,] "kickstarter" "melanoma" "romney" "indulgent"
## [7,] "potato" "mite" "1914" "authoritative"
## [8,] "fischler" "leveritt" "majors" "neglectful"
## Topic 5 Topic 6 Topic 7 Topic 8 Topic 9
## [1,] "shiffman" "insects" "harding" "colbert" "oxytocin"
## [2,] "whysharksmatter" "saline" "maternal" "peretti" "laurie"
## [3,] "shark" "rowling" "baldness" "opioids" "patti"
## [4,] "sharkweek" "mantises" "kassebaum" "brenna" "towel"
## [5,] "mitt" "mantis" "semitic" "poetry" "mitt"
## [6,] "romney" "stereoscopic" "gaba" "gaba" "romney"
## [7,] "1914" "mites" "mitt" "mitt" "1914"
## [8,] "gesture" "gaba" "romney" "romney" "majors"
## Topic 10 Topic 11 Topic 12 Topic 13 Topic 14
## [1,] "comet" "nsa" "wiseman" "spider" "driscoll"
## [2,] "oed" "wheaton" "casarett" "1776" "mitt"
## [3,] "khattala" "lantern" "gawande" "neurogenesis" "romney"
## [4,] "retiree" "salinger" "chipotle" "murthy" "1914"
## [5,] "meteor" "boko" "underwear" "inconsistencies" "majors"
## [6,] "rosetta" "haram" "nate" "newuscitizen" "ebola"
## [7,] "drones" "mitt" "burrito" "charlebois" "sugar"
## [8,] "mirzoyan" "romney" "greenwell" "ibuprofen" "suisse"
## Topic 15 Topic 16 Topic 17 Topic 18
## [1,] "penguin" "kynect" "yellowstone" "cosplay"
## [2,] "penguins" "semitic" "kalt" "costume"
## [3,] "sweaters" "gentrification" "hangry" "smokey"
## [4,] "brenner" "kantor" "metricmaps" "cosplayers"
## [5,] "overreporting" "suisse" "mckenzie" "mitt"
## [6,] "gaba" "osc" "roser" "romney"
## [7,] "mitt" "americana" "heyburn" "1914"
## [8,] "romney" "danks" "rectangle" "majors"
## Topic 19 Topic 20 Topic 21 Topic 22 Topic 23
## [1,] "languages" "twitch" "herpes" "deboarding" "postal"
## [2,] "oxt" "gamers" "inversion" "cramping" "myers"
## [3,] "crabb" "nintendo" "inversions" "implantation" "gaba"
## [4,] "nom" "gaba" "nyong" "haskins" "balluga"
## [5,] "linguistic" "mitt" "hsv" "vasectomies" "korean"
## [6,] "gust" "romney" "adichie" "rockstar" "katan"
## [7,] "petzel" "1914" "americanah" "lessig" "blake"
## [8,] "absences" "majors" "mitt" "tubes" "nintendo"
## Topic 24 Topic 25 Topic 26 Topic 27 Topic 28
## [1,] "isis" "lgbtq" "hannibal" "corden" "cigarettes"
## [2,] "ebola" "stonewall" "chikungunya" "ebola" "telemedicine"
## [3,] "leone" "cantu" "brett" "fgm" "pager"
## [4,] "maliki" "eeoc" "cher" "tkm" "mitt"
## [5,] "port" "syphilis" "driverless" "specimens" "romney"
## [6,] "assad" "gaba" "mcgurk" "pangaea" "1914"
## [7,] "bitcoin" "mitt" "healy" "receptacle" "majors"
## [8,] "mitt" "romney" "isil" "supercontinent" "ebola"
## Topic 29 Topic 30 Topic 31 Topic 32 Topic 33 Topic 34
## [1,] "dove" "bikeshare" "implants" "bitcoin" "revolving" "ferrets"
## [2,] "powell" "dinosaur" "polio" "textisms" "reactors" "lichter"
## [3,] "autoimmune" "comcast" "syphilis" "texting" "bieber" "artesia"
## [4,] "executions" "fossils" "deaf" "grammar" "gomez" "ferret"
## [5,] "injections" "lucentis" "scissors" "depict" "anthrax" "gaba"
## [6,] "mcauliffe" "avastin" "jawbone" "mitt" "pandas" "mitt"
## [7,] "mcelwee" "coal" "cre" "romney" "annotated" "romney"
## [8,] "ldh69u18oc" "specimens" "butterfly" "1914" "coal" "1914"
## Topic 35 Topic 36 Topic 37 Topic 38 Topic 39
## [1,] "oliver" "walkable" "elevator" "inhumans" "del"
## [2,] "majors" "singh" "beers" "pandas" "rey"
## [3,] "leech" "dinger" "zook" "panda" "lana"
## [4,] "leeches" "misdiagnoses" "solange" "bao" "francis"
## [5,] "iau" "kantar" "zenko" "vin" "mitt"
## [6,] "aerosmith" "pitchforks" "yellen" "mitt" "romney"
## [7,] "comey" "powerless" "asat" "romney" "1914"
## [8,] "sturm" "fuchsia" "icloud" "1914" "gesture"
## Topic 40 Topic 41 Topic 42 Topic 43
## [1,] "benghazi" "francis" "sovaldi" "takei"
## [2,] "lucid" "epigenetic" "bucket" "cte"
## [3,] "dreaming" "epigenetics" "shoup" "initials"
## [4,] "everytown" "spellers" "realtor" "mckenna"
## [5,] "neyestani" "mitt" "putin" "taylorswift13"
## [6,] "cartoon" "romney" "phonebook" "hiroshima"
## [7,] "gongloff" "1914" "interstellar" "orca"
## [8,] "nonmetuammori" "majors" "23andme" "port"
## Topic 44 Topic 45 Topic 46 Topic 47 Topic 48
## [1,] "slater" "daca" "otters" "sugar" "eruption"
## [2,] "napa" "gel" "diploma" "seleka" "¡r"
## [3,] "kael" "octane" "piazza" "mudge" "°arbunga"
## [4,] "tribes" "pichler" "1914" "fructose" "dike"
## [5,] "quake" "snub" "scarborough" "poet" "github"
## [6,] "morgan" "dips" "endowment" "lewismudge" "mitt"
## [7,] "typhoon" "selfies" "dayton" "poetry" "romney"
## [8,] "hurricanes" "beckley" "pho" "carcrisis" "1914"
## Topic 49 Topic 50
## [1,] "coal" "normcore"
## [2,] "ligotti" "pollen"
## [3,] "crossfit" "trademark"
## [4,] "pizzolatto" "redskins"
## [5,] "grimes" "mitt"
## [6,] "mitt" "romney"
## [7,] "romney" "salad"
## [8,] "1914" "gesture"
Según la cantidad que se elija, el algoritmo devuelve un conjunto de tópicos extraidos de los textos. Un ejemplo interesante es el mostrado a continuación, donde se ve que efectivamente se extraen el tópico de una noticia relacionada con Chile de una manera bastante certera.
Como dijimos anteriormente, la línea de trabajo escogida, equivale a lo que dice el título, o sea, un análisis de sentimiento para temas o conceptos en particular (tratados en los distintos articulos de la plataforma VOX) obtenidos a través de la clusterización de estos, utilizando así este modelo como método de categorización. Este proyecto lo podemos dividir en los siguientes pasos (que incluyen los procesos utilizados en el Hito 2):
Primero, importaremos todas las librerias necesarias:
library(readr, warn.conflicts=F, quietly=T)
library(RCurl)
library(XML)
library(wordcloud)
library(cluster)
library(tm)
library(RTextTools)
library(stringi)
library(proxy)
library(tidytext)
library(tidyverse)
library(tidytext)
library(glue)
library(stringr)
Ahora, continuamos con la extracción de los datos a utilizar, eliminando los que tengan algún NA y limitando el tamaño del dataset para realizar las pruebas:
articles <- df[complete.cases(df),c(1,8) ,drop=FALSE] # Todos los articulos sin NA en alguna columna
articles <- articles[1:500,] # Sub-muestreo de articulos
Definimos una función llamada GetSentiment, la cual obtiene el sentimiento general (positivo - negativo) recibiendo como parámetro el artículo correspondiente:
# Funcion para obtener el sentimiento general
GetSentiment <- function(article){
# tokenize
tokens <- data_frame(text = article) %>% unnest_tokens(word, text)
# get the sentiment from the first text:
sentiment <- tokens %>%
inner_join(get_sentiments("bing"), quietly=T) %>% # pull out only sentimen words
count(sentiment) %>% # count the # of positive & negative words
spread(sentiment, n, fill = 0) # made data wide rather than narrow
if (length(sentiment)==2) {
return((sentiment$positive - sentiment$negative))
}
else if (length(sentiment)==1){
if (names(sentiment)=="negative") {
return(-sentiment$negative)
}
else {
return(sentiment$positive)
}
}
else {
return(0)
}
}
En esta parte, realizamos el loop por cada artículo para realizar la limpieza asociada al body mismo, y para agregar una columna del sentimiento general obtenida con la función descrita previamente:
Luego, los body de los artículos se transforman al formato corpus, y luego a DocumentTermMatrix (DTM), utilizado para el cálculo de la matriz de distancia (y posteriormente para clustering). En este código, podemos observar que se definen dos valores, minTermFreq y maxTermFreq, que definen el rango de frecuencias aceptables para las palabras de cada documento (o sea, que las palabras aparezcan al menos un 1% y menos que un 50% en el artículo). Luego, al momento de crear el DTM, se observa que nuevamente se realiza la exclusión de las stopwords, la puntuación y los números, y la definición de un rango del largo de la palabra permitido. Finalmente, se realiza la transformación de este DTM a un formato llamado “Term frequency-inverse document frequency”, que realiza un análisis de la frecuencia global de cada palabra, para así restarle peso a las palabras cuya frecuencia sea mayor en todos los artículos (pues significa que no tienen una importancia para caracterizar a un artículo en particular). Con este proceso realizado, finalmente se calcula la matriz de distancia, la cual asignamos que sea por medio de la distancia euclidiana (y no de coseno) sin ninguna razón en particular (no se observa mayor cambio al probar ambas distancias):
corpus = tm::Corpus(tm::VectorSource(articles[,2]))
ndocs <- length(corpus)
minTermFreq <- ndocs * 0.01 # Porcentaje minimo de aparicion de palabras (inusuales)
maxTermFreq <- ndocs * .5 # Porcentaje máximo de palabras (extremadamente frecuentes)
dtm = DocumentTermMatrix(corpus,
control = list(
stopwords = TRUE,
wordLengths=c(4, 15),
removePunctuation = T,
removeNumbers = T,
bounds = list(global = c(minTermFreq, maxTermFreq))
))
dtm <- weightTfIdf(dtm, normalize = TRUE)
dtm.matrix <- as.matrix(dtm)
distMatrix <- dist(dtm.matrix, method="euclidean")
#distMatrix <- dist(dtm.matrix, method="cosine")
Esta sección del código, como vemos, realiza la clusterización de los documentos. Se escogió el algoritmo de clusterización DBSCAN por una sencilla razón: No requiere un parámetro de clusters aproximado. Esto, en los algormitos K-Means y Hierarchical Clustering es necesario, y dado que por la naturaleza de nuestros datos es imposible encontrar un aproximado (dependerá netamente de cuantos artículos hablen de un mismo tema,lo que es indeterminable a priori), nos conviene entonces realizar una búsqueda de los clusters por medio de la densidad de datos, pues este acercamiento nos permite de manera más fina separar un elemento ruido (elemento aislado), de lo que podría ser un cluster (al menos dos artículos con similar grupo de palabras). Para esto, se selecciona entonces el parámetro minPts = 2, para que forme un cluster con al menos dos elementos, y un epsilon = 0.3 obtenido por medio de la experimentación (existen métodos para determinar un buen epsilon aproximado pero por temas de tiempo fue imposible de implementar). Una vez que el clustering esta listo, obtendremos todos los que no son ruido y los visualizaremos, para observar que sus títulos corresponden a temas similares.
clustering.dbscan <- dbscan::dbscan(distMatrix, eps=0.3, minPts = 2)
articles$cluster <- factor(clustering.dbscan$cluster)
articles <- articles[articles$cluster!=0,c(1,3,4)]
articlesord <- articles[order(articles$cluster),]
articlesord
## title
## 2 6 health problems marijuana could treat better than traditional medicine
## 48 The federal government's absurd restrictions on medical marijuana
## 336 6 ways the federal government continues its war on marijuana
## 418 Federal restrictions on pot are under review. Here's what that means.
## 4 Remember when legal marijuana was going to send crime skyrocketing?
## 157 Since Denver legalized pot sales, revenue is up and crime is down
## 5 Obamacare succeeded for one simple reason: it's horrible to be uninsured
## 6 The best Obamacare data comes from a home office in Michigan
## 11 How many people have insurance because of Obamacare?
## 61 Way more than 8 million people have signed up for Obamacare
## 103 Insurers sound pretty happy about Obamacare
## 105 White House: More health spending means Obamacare is working
## 114 Seven things we now know about Obamacare enrollment
## 160 Six reasons Obamacare premiums are going up next year
## 164 What the hell is happening at the VA?
## 174 Meet Obamacare's secret weapon in the war on exorbitant health-care costs
## 203 21 things Obamacare does that you didn't know about
## 214 The VA scandal, explained
## 216 Colorado's director of pot enforcement thinks legalization is going great
## 234 Republicans want to privatize the VA. Veteran groups disagree.
## 250 Screwed-up bonus payments are at the heart of the VA scandal
## 261 11 things most people don't know about health insurance
## 276 Does the Dr. Dre-Apple deal mean hip-hop is selling out?
## 293 Two ways to fix the VA for free — and one that could cost money
## 302 A former Obama advisor: sometimes less health care is better health care
## 316 An interview with Healthcare.gov's new chief executive
## 346 An interview with Dr. James Andrews, the man who's operated on all your favorite athletes
## 361 Washington state loves Obamacare — and still has challenges making it work
## 365 Five ways the American health care system is literally the worst
## 384 Kentucky governor: Without Obamacare, there is no Kynect
## 386 Obamacare's sticker shock is real, but it's not as bad as advertised
## 388 The VA is finally conducting monthly inspections at hospitals and clinics
## 396 Most Obamacare plans will be more expensive next year. Here's why.
## 406 States don't know how they'll pay for year two of Obamacare
## 492 The giant problem American health care ignores
## 35 The $2.8 trillion question: Are health costs growing fast again?
## 356 Orszag: It's time for some optimism about health care spending
## 433 Health spending actually fell while Obamacare insured Americans
## 158 Arkansas and Michigan prove Republicans can compromise on Medicaid
## 169 Another red state just caved on Obamacare
## 188 Republican governors have found something they like about Obamacare
## 253 Why Washington is taking so long to get recreational pot in stores
## 497 Washington shops can now sell weed legally. Too bad there's a pot shortage.
## 397 The Supreme Court just restricted software patents. Here's what that means.
## 400 The Supreme Court doesn't understand software, and that's a problem
## sentiment cluster
## 2 -17 1
## 48 1 1
## 336 -20 1
## 418 -6 1
## 4 -29 2
## 157 -10 2
## 5 -25 3
## 6 14 3
## 11 7 3
## 61 16 3
## 103 6 3
## 105 12 3
## 114 8 3
## 160 8 3
## 164 -51 3
## 174 22 3
## 203 2 3
## 214 -72 3
## 216 47 3
## 234 3 3
## 250 13 3
## 261 0 3
## 276 24 3
## 293 -2 3
## 302 12 3
## 316 39 3
## 346 10 3
## 361 9 3
## 365 14 3
## 384 57 3
## 386 15 3
## 388 -13 3
## 396 6 3
## 406 12 3
## 492 29 3
## 35 4 4
## 356 12 4
## 433 -1 4
## 158 38 5
## 169 33 5
## 188 33 5
## 253 2 6
## 497 -4 6
## 397 -3 7
## 400 3 7
Aquí podemos observar cada uno de los títulos del artículo y su respectivo cluster. Bien se observa que los artículos guardan relación, como por ejemplo el cluster 1 que son artículos referentes a la marihuana y el cluster 3 que habla del “ObamaCare”
aggregate(sentiment ~ cluster, articlesord, sum)
## cluster sentiment
## 1 1 -42
## 2 2 -39
## 3 3 222
## 4 4 15
## 5 5 104
## 6 6 -2
## 7 7 0
Después, podemos observar la suma de los sentimientos de todos los artículos dentro de cada cluster, para así determinar su balances (si se habla en general positiva o negativamente). Aquí vemos entonces, que la marihuana es un tema tratado de forma más negativa, mientras que el Obamacare de forma positiva.
Finalmente, se realiza un plot del resultado obtenido en la matriz de distancia, realizando una transformación 2-dimensional de nuestra matriz para así poder graficarla. En esta, podemos observar en negro todos los artículos que son ruido y en color cada uno de los clusters.
slave.dbscan <- clustering.dbscan$cluster
points <- cmdscale(distMatrix, k = 2)
palette <- colorspace::diverge_hcl(7) # Creating a color palette
previous.par <- par(mfrow=c(2,2), mar = rep(1.5, 4))
plot(points, main = 'Density-based clustering', col = as.factor(slave.dbscan),
mai = c(0, 0, 0, 0), mar = c(0, 0, 0, 0),
xaxt = 'n', yaxt = 'n', xlab = '', ylab = '')
En general, a pesar de no haber sido los objetivos cumplidos en su cabalidad, los resultados se consideran satisfactorios, debido a que se lograron realizar los experimentos propuestos, haciendo uso de las herramientas provistas en el curso.
En general, faltó mejorar bastantes cosas, por ejemplo: * Mejorar y refinar la limpieza de los datos. Por ejemplo, como sugirió un compañero, considerar las etiquetas HTML como algo importante a la hora de clusterizar (por ejemplo asignarle una importancia superior a algo que está en una etiqueta h1)
Mejorar las estimaciones de los parámetros del DBScan (eps y minPts), ya que estos no fueron estimados de una manera muy rigurosa en esta ocasión.
Utilizar otras herramientas de clusterización, para tener más resultados para comparar.
Hacer un análisis más preciso de los sentimientos. Quizás con otra librería distinta, que permita expander los resultados a algo más que positivo o negativo.
Hacer un análisis de la calidad de los clusters, con las métricas vistas en el curso.
Optimizar los procesos para que se pueda trabajar con más datos (idealmente toda la muestra), ya que en esta ocasión sólo se pudo tomar una muestra de 1000 datos, puesto que con más el programa no funcionaba.
Trabajar en las visualizaciones de los gráficos, ya que en esta ocasión se tienen unas bastante pobres que no permiten mucho estudio sobre ellas.